In this part of the course, we will cover the following concepts:
| Objective | Complete |
|---|---|
| Load and evaluate the dataset | |
| Address missing values in data |
Data scientists spend the vast majority of their time preparing data for analysis using an array of data wrangling techniques
You might also hear these processes referred to as:
Data wrangling refers to a variety of methods to transform data from its current state into a format suitable for analysis and processing
Both manual and automatic processes contribute to data wrangling
ChemicalManufacturingProcess from AppliedPredictiveModeling package in Rbox package and encode your directory structure into variablesmain_dir be the variable corresponding to your materials folderdata directory inside of the materials folder in your environment, so we’ll save its path to a data_dir variableplots directory corresponding to plot_dir variablepaste0 command and pass the strings you would like to paste togetherdata_dir into R’s environment# Read CSV file called "ChemicalManufacturingProcess.csv"
CMP = read.csv(file = file.path(data_dir,"/ChemicalManufacturingProcess.csv"), #<- provide file path
header = TRUE, #<- if file has header set to TRUE
stringsAsFactors = FALSE) #<- read strings as characters, not as factorsIn this module we will explore a subset of this data set, which includes the following variables
…
# Let's make a vector of column indices we would like to save.
column_ids = c(1:4, #<- concatenate a range of ids
14:16)#<- with another a range of ids
column_ids #<- verify that we have the correct set of columns[1] 1 2 3 4 14 15 16
# Let's save the subset into a new variable and look at its structure.
CMP_subset = CMP[ , column_ids]
str(CMP_subset)'data.frame': 176 obs. of 7 variables:
$ Yield : num 38 42.4 42 41.4 42.5 ...
$ BiologicalMaterial01 : num 6.25 8.01 8.01 8.01 7.47 6.12 7.48 6.94 6.94 6.94 ...
$ BiologicalMaterial02 : num 49.6 61 61 61 63.3 ...
$ BiologicalMaterial03 : num 57 67.5 67.5 67.5 72.2 ...
$ ManufacturingProcess01: num NA 0 0 0 10.7 12 11.5 12 12 12 ...
$ ManufacturingProcess02: num NA 0 0 0 0 0 0 0 0 0 ...
$ ManufacturingProcess03: num NA NA NA NA NA NA 1.56 1.55 1.56 1.55 ...
To get quick summary statistics for a vector, a data frame, or a subset of your data frame, use summary()
For numeric data types, summary() will output:
Yield BiologicalMaterial01 BiologicalMaterial02 BiologicalMaterial03
Min. :35.25 Min. :4.580 Min. :46.87 Min. :56.97
1st Qu.:38.75 1st Qu.:5.978 1st Qu.:52.68 1st Qu.:64.98
Median :39.97 Median :6.305 Median :55.09 Median :67.22
Mean :40.18 Mean :6.411 Mean :55.69 Mean :67.70
3rd Qu.:41.48 3rd Qu.:6.870 3rd Qu.:58.74 3rd Qu.:70.43
Max. :46.34 Max. :8.810 Max. :64.75 Max. :78.25
ManufacturingProcess01 ManufacturingProcess02 ManufacturingProcess03
Min. : 0.00 Min. : 0.00 Min. :1.47
1st Qu.:10.80 1st Qu.:19.30 1st Qu.:1.53
Median :11.40 Median :21.00 Median :1.54
Mean :11.21 Mean :16.68 Mean :1.54
3rd Qu.:12.15 3rd Qu.:21.50 3rd Qu.:1.55
Max. :14.10 Max. :22.50 Max. :1.60
NA's :1 NA's :3 NA's :15
| Objective | Complete |
|---|---|
| Load and evaluate the dataset |
✔ |
| Address missing values in data |
max()# Let's try and compute the maximum value of the 1st manufacturing process.
max_process01 = max(CMP_subset$ManufacturingProcess01)
max_process01[1] NA
[1] 14.1
na.rm = TRUE to ignore NA valuesWhat if the function being used does not have the method na.rm? Or what if removing NAs skews the results?
Data imputation with one of the following values will help to overcome this:
0meanmedianReplacing NAs with mean may not work well if the data contains outliers
is.na will provide a vector of TRUE or FALSE for each element of a given vectorNA for datasets containing even a moderate number of data points# Let's take a look at `ManufacturingProcess01`
# and see if any of the values in it are `NA`.
is.na(CMP_subset$ManufacturingProcess01) [1] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[97] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[109] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[121] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[145] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[157] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[169] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
which function is an invaluable utility function in R’s base packagelogical values OR a vector/array of any values with a comparison statement in one of the comparison operators (==, !=, >, <, >=, <=) and a value to which we are comparing toindices of all TRUE values of the logical vector, or the indices of all the values that meet the condition we specifiedwhich() to identify the index of all values that are TRUE# Let's save this vector of logical values to a variable.
is_na_MP01 = is.na(CMP_subset$ManufacturingProcess01)
# To determine WHICH elements in the vector are `TRUE`and are NA, we will use `which` function.
# Since we already have a vector of `TRUE` or `FALSE` logical values
# we only have to give it to `which` and it will return all of the
# indices of values that are `TRUE`.
which(is_na_MP01)[1] 1
[1] 1
ManufacturingProcess01 is NA, it can be selected programmatically without having to type its index manually using bracket syntax[1] 1
[1] NA
NAmean of the variable as a replacement# Compute the mean of the `ManufacturingProcess01`.
mean_process01 = mean(CMP_subset$ManufacturingProcess01)
mean_process01[1] NA
na.rm = TRUE in order to compute the mean of the variable that contains NAs# Compute the mean of the `ManufacturingProcess01` and set `na.rm` to `TRUE`.
mean_process01 = mean(CMP_subset$ManufacturingProcess01, na.rm = TRUE)
mean_process01[1] 11.20743
# Assign the mean to the entry with the `NA`.
CMP_subset$ManufacturingProcess01[na_id] = mean_process01
CMP_subset$ManufacturingProcess01[na_id][1] 11.20743
[1] 14.1
# Impute missing values of `ManufacturingProcess02` with the mean
is_na = is.na(CMP_subset$ManufacturingProcess02)
na_id = which(is_na)
mean_process02 = mean(CMP_subset$ManufacturingProcess02, na.rm = TRUE)
CMP_subset$ManufacturingProcess02[na_id] = mean_process02
# Impute missing values of `ManufacturingProcess03` with the mean
is_na = is.na(CMP_subset$ManufacturingProcess03)
na_id = which(is_na)
mean_process03 = mean(CMP_subset$ManufacturingProcess03, na.rm = TRUE)
CMP_subset$ManufacturingProcess03[na_id] = mean_process03| Objective | Complete |
|---|---|
| Load and evaluate the dataset |
✔ |
| Address missing values in data |
✔ |
You are now ready to try Tasks 1-5 in the Exercise for this topic